Skip to content

Add proposal for per-tenant cardinality API#7335

Open
CharlieTLe wants to merge 6 commits intocortexproject:masterfrom
CharlieTLe:proposal/per-tenant-tsdb-status-api
Open

Add proposal for per-tenant cardinality API#7335
CharlieTLe wants to merge 6 commits intocortexproject:masterfrom
CharlieTLe:proposal/per-tenant-tsdb-status-api

Conversation

@CharlieTLe
Copy link
Member

@CharlieTLe CharlieTLe commented Mar 7, 2026

Summary

Proposal for a per-tenant cardinality API (GET /api/v1/cardinality) that exposes cardinality statistics (top metrics by series count, top labels by value count, top label-value pairs by series count) across two data sources:

  • source=head: Fans out to ingesters via the distributor, aggregates TSDB head stats with RF-based deduplication.
  • source=blocks: Fans out to store gateways via BlocksFinder + GetClientsFor, computes cardinality from block indexes with per-block caching.

Key design points:

  • start/end required for blocks path, rejected for head path (head cannot sub-filter)
  • Per-tenant limits: cardinality_api_enabled, cardinality_max_query_range, cardinality_max_concurrent_requests, cardinality_query_timeout
  • Standard {status, data} Prometheus response envelope with approximated field for block overlap / partial results
  • Phased rollout: head path first, blocks path second, behind per-tenant feature flag

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>

Currently, Cortex tenants lack visibility into which metrics, labels, and label-value pairs contribute the most series in ingesters. Without this information, debugging high-cardinality issues requires operators to inspect TSDB internals directly on ingester instances, which is impractical in a multi-tenant, distributed environment.

Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardinality statistics from the TSDB head. This proposal brings equivalent functionality to Cortex as a multi-tenant, distributed API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of TSDB status API name... Prometheus API might change and add more stuff. A dedicated api/v1/cardinality might be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We might not use the prometheus TSDB in the future.


## Out of Scope

- **Long-term storage cardinality analysis**: This endpoint only covers in-memory TSDB head data in ingesters. Analyzing cardinality across compacted blocks in object storage is a separate concern. A future long-term cardinality API could reuse portable fields (see [Extensibility](#extensibility-to-long-term-storage)) or introduce a separate endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to have a different API for long term storage cardinality? We should aim for the same API endpoint even though we don't have to design for it now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should plan for this too. Probably sooner than later


Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:

1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this needs to be as part of the goal. Does it need to be compatible.
I think our API response format is already incompatible today

```

- **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).
- **Query Parameter**: `limit` (optional, default 10) - controls the number of top items returned per category.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about start and end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we need start and end. Sometimes cardinality issues are specific in time

message TSDBStatusResponse {
uint64 num_series = 1;
int64 min_time = 2;
int64 max_time = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need min max? How do we aggregate this in the final response? min(min_t) and max(max_t)?


2. **`chunkCount` omitted**: Prometheus includes a `chunkCount` field (from `prometheus_tsdb_head_chunks`). In a distributed system with replication, chunk counts across ingesters cannot be meaningfully aggregated — chunks are an ingester-local storage detail, and summing/dividing by the replication factor does not produce a useful number.

**Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any Prometheus tool consumes this today? Why compatibility is a concern

| `labelValueCountByLabelName` | No | Portable to block storage |
| `seriesCountByLabelValuePair` | No | Portable to block storage |
| `memoryInBytesByLabelName` | **Yes** | In-memory byte usage has no analogue in object storage |
| `minTime` / `maxTime` | **Yes** | Reflects head time range, not total storage |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add those head specific fields?

CharlieTLe and others added 5 commits March 13, 2026 12:38
…ore gateways

Add source=blocks query parameter to analyze cardinality from compacted
blocks in object storage. The blocks path fans out to store gateways,
which compute statistics from block index headers (cheap label value
counts) and posting list expansion (exact series counts per metric).
Results are cached per immutable block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…plify

Address feedback from PR cortexproject#7335 review:
- Rename endpoint from /api/v1/status/tsdb to /api/v1/cardinality
- Drop Prometheus compatibility as a goal
- Add start/end time range query parameters
- Drop head-specific fields (numLabelPairs, memoryInBytesByLabelName,
  minTime, maxTime) to unify response across both sources
- Remove API Compatibility and Field Portability sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
…limit

Make start/end required for source=blocks to prevent unbounded block
scanning. Add cardinality_max_query_range per-tenant limit (default 24h)
to give operators control over the blast radius.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Critical:
- Fix blocks path aggregation: no SG RF division since GetClientsFor
  routes each block to exactly one store gateway

Significant:
- Add min_time, max_time, block_ids to store gateway CardinalityRequest
- Specify MaxErrors=0 for head path with availability implications
- Add consistency check and retry logic for blocks path
- Document RF division as best-effort approximation

Moderate:
- Wrap responses in standard {status, data} Prometheus envelope
- Change HTTP 422 to HTTP 400 for limit violations
- Add Error Responses section with all validation scenarios
- Add approximated field for block overlap and partial results
- Add Observability section with metrics
- Add per-tenant concurrency limit and query timeout
- Reject start/end for source=head instead of silently ignoring

Low:
- Add Rollout Plan with phased approach and feature flag
- Document rolling upgrade compatibility (Unimplemented handling)
- Document Query Frontend bypass
- Improve caching: full results keyed by ULID, limit at response time
- Add missing files to implementation section
- Move shared proto to pkg/cortexpb/cardinality.proto
- Rename TSDBStatus* to Cardinality* throughout
- Add limit upper bound (max 512)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe changed the title Add proposal for per-tenant TSDB status API Add proposal for per-tenant cardinality API Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants